fix(duplication): prevent potential crash when removing or pausing duplication by lengyuexuexuan · Pull Request #2368 · apache/incubator-pegasus

lengyuexuexuan · 2026-02-12T08:09:22Z

What problem does this PR solve?

#2211

What is changed and how does it work?

Background
During duplication of a table, if the commands dup remove/pause are executed or a balance operation is performed at the same time, there is a chance that a node may core dump with signal ID 11. The core dump locations vary, but they all have one thing in common: they occur during memory allocation or deallocation.
Analysis
Based on extensive testing, the following conclusions can be drawn:
a. The issue only reproduces when there is write traffic. The difference between having and not having write traffic is: It adds the ship and load_private_log tasks.
b. The core dump occurs during the execution of cancel_all().
c. The issue occurs with low probability (approximately 1 in 100).

Through analysis using ASAN (AddressSanitizer):
dup_remove_asan.txt
Based on ASAN analysis, the following conclusions can be drawn:
a. The memory corruption occurs during the ship process. The mutations obtained from replaying the plog are passed to ship, leading to the issue.
b. _load_mutations is captured by a lambda expression and then passed to a std::function. Since std::move is used, the lifetime of _load_mutations is tied to that of the std::function.
c. The cancel_all() function is executed in the default thread pool. At this point, the following function is called. When the std::function is set to nullptr, it will release the memory it manages.

incubator-pegasus/src/task/task.h

Line 341 in e64faa7

void clear_non_trivial_on_task_end() override { _cb = nullptr; }

d. However, each task executes exec_internal() in its own thread pool, and eventually calls release_ref(), which results in delete this.

incubator-pegasus/src/task/task.cpp

Line 224 in e64faa7

this->release_ref(); // added in enqueue(pool)
Conclusion
1. Both task.cancel() and task.exec_internal() destruct the std::function member inside the task object. These two operations are executed in different threads, and there is no mechanism in place to prevent race conditions between them. As a result, it is possible for both threads to attempt to destruct the same std::function, which can lead to a double deletion of the memory associated with _load_mutations. This ultimately causes memory corruption.
2. _duplications is accessed without proper synchronization in certain functions under multi-threaded scenarios, potentially causing race conditions.
Solution
1. Lock the _cb callback to ensure that only one thread executes its destructor.
2. Add locking to functions that access _duplications without synchronization to prevent concurrent access conflicts.

Tests

Manual test
The changes have been production-validated at Xiaomi, running stably on more than 30 clusters for over six months, confirming that they resolve the concurrency issues described above.

…plica migration

empiredan · 2026-02-24T03:01:05Z

Hi @lengyuexuexuan Thank you for your contribution!

Please modify the code according to the suggestions provided by Clang Tidy and IWYU.

lengyuexuexuan · 2026-02-26T11:50:13Z

Hi @lengyuexuexuan Thank you for your contribution!

Please modify the code according to the suggestions provided by Clang Tidy and IWYU.

done

empiredan · 2026-03-06T11:12:00Z

src/replica/duplication/replica_duplicator_manager.cpp

        return;
    }

-    zauto_lock l(_lock);


Does _primary_confirmed_decree below no longer need protection by _lock?

acelyc111-bot

Review: Prevent potential crash when removing/pausing duplication

Summary: Fixes race conditions in replica_duplicator_manager by adjusting lock placement. Also adds thread safety to raw_task callback clearing.

What's good:

Moving lock acquisition earlier in sync_duplication prevents accessing shared state before protection
Removing the redundant lock in update_confirmed_decree_if_secondary is correct since remove_all_duplications now acquires the lock itself
The raw_task destructor + clear_non_trivial_on_task_end lock prevents use-after-free on the callback during concurrent task completion

Concerns:

Lock scope change in sync_duplication — The lock now covers the entire function including the early return path and all the logic below. Previously only the mutation of _duplications was locked. This is correct for safety but increases the lock hold time. If sync_duplication is called frequently or does heavy work, this could cause contention. Worth checking if there are any blocking calls inside.
Double lock possibility — update_confirmed_decree_if_secondary calls remove_all_duplications() which now acquires _lock. If any caller of update_confirmed_decree_if_secondary already holds _lock, this would deadlock. Please verify there are no such call chains.
raw_task lock naming — _lock is very generic. Consider _cb_lock or _callback_lock for clarity.

Verdict: ⚠️ Request Changes — Please verify no double-locking path exists for update_confirmed_decree_if_secondary → remove_all_duplications.

lixuejian7 added 2 commits February 12, 2026 15:34

fix(duplication): fix race condition in _duplications access

f0acd67

fix(duplication): fix potential core of duplication during primary re…

e2349bb

…plica migration

github-actions bot added the cpp label Feb 12, 2026

style: format task.h with clang-format

871b37d

lixuejian7 added 2 commits February 26, 2026 16:32

style: fix task.h

15f1dfa

style: fix replica_duplicator_manager.cpp

da7b188

empiredan reviewed Mar 6, 2026

View reviewed changes

acelyc111-bot reviewed Mar 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(duplication): prevent potential crash when removing or pausing duplication#2368

fix(duplication): prevent potential crash when removing or pausing duplication#2368
lengyuexuexuan wants to merge 5 commits intoapache:masterfrom
lengyuexuexuan:fix_duplication

lengyuexuexuan commented Feb 12, 2026

Uh oh!

empiredan commented Feb 24, 2026

Uh oh!

lengyuexuexuan commented Feb 26, 2026

Uh oh!

empiredan Mar 6, 2026

Uh oh!

acelyc111-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lengyuexuexuan commented Feb 12, 2026

What problem does this PR solve?

What is changed and how does it work?

Tests

Uh oh!

empiredan commented Feb 24, 2026

Uh oh!

lengyuexuexuan commented Feb 26, 2026

Uh oh!

empiredan Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

acelyc111-bot left a comment

Choose a reason for hiding this comment

Review: Prevent potential crash when removing/pausing duplication

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants